Skip to content

[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805

Closed
MichaelUray wants to merge 3 commits intonetbirdio:mainfrom
MichaelUray:fix/ice-guard-loop-and-candidate-buffer
Closed

[client] Fix ICE reconnection loop and buffer candidates before agent initialization#5805
MichaelUray wants to merge 3 commits intonetbirdio:mainfrom
MichaelUray:fix/ice-guard-loop-and-candidate-buffer

Conversation

@MichaelUray
Copy link
Copy Markdown
Contributor

@MichaelUray MichaelUray commented Apr 6, 2026

Summary

  • Guard Loop Fix: When both peers are behind the same NAT (same public IP), their connectivity guards fire simultaneously every ~12s, each sending a new offer with a new session ID. This cancels the other side's ICE agent before it can finish negotiating, creating an infinite reconnection loop. Fix: skip new offers while agentConnecting == true to let the current ICE attempt complete.
  • Candidate Buffering: Remote ICE candidates that arrive before the local ICE agent is created were silently dropped. Add a pendingCandidates buffer that collects early candidates and flushes them once the agent is ready.

Both fixes are tightly coupled — without the buffer, the guard loop fix alone would still miss early candidates.

Checklist

  • Bug fix
  • Create tests that fail without the change: ICE connections between peers behind the same NAT consistently failed before this fix
  • Documentation not needed — internal ICE behavior change, no user-facing API change

By submitting this pull request, I confirm that I have read and agree to the terms of the Contributor License Agreement.

Related Issues

Fixes #3669 — Unable to connect to netbird peer on different LAN with the same public IP
Fixes #5672 — ICE Agent is not initialized yet on android app
Related #2703 — Multiple peers behind NAT getting relayed
Related #3339 — Same LAN cannot successfully P2P
Related #4031 — No P2P/direct connection, only relayed
Related #4225 — Netbird Self-Hosted P2P not working
Related #5670 — macOS ICE agent never initializes (arm64)

Summary by CodeRabbit

  • Bug Fixes
    • Buffer early remote connection candidates with a bounded cap to avoid lost candidates and limit memory use.
    • Expand and queue applicable server-reflexive candidates for later delivery instead of dropping them.
    • Prevent rapid reconnection churn by skipping replacement of a recently-started agent while it’s still connecting (adds a short protection window).
    • Flush buffered candidates into a newly created connection and clear the buffer on close to prevent stale data.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 6, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added an agent loop-protection window and an agentStartTime field; OnNewOffer now avoids replacing an existing connecting ICE agent if it was created within the protection window and the remote SessionID differs. agentStartTime is set when a new agent is created.

Changes

Cohort / File(s) Summary
Worker ICE reconnection logic
client/internal/peer/worker_ice.go
Introduce agentLoopProtectionWindow constant and WorkerICE.agentStartTime field; set agentStartTime = time.Now() when creating a new ICE agent; in OnNewOffer skip replacing an agent that is currently connecting if its age < protection window and remote SessionID differs. No changes to candidate handling.

Sequence Diagram

sequenceDiagram
    participant Client
    participant WorkerICE
    participant Agent

    Client->>WorkerICE: OnNewOffer(remoteSessionID)
    WorkerICE->>WorkerICE: if w.agent != nil && agent.State == Connecting
    alt agent.SessionID != remoteSessionID and (now - w.agentStartTime) < agentLoopProtectionWindow
        WorkerICE-->>Client: return (skip replacing recent connecting agent)
    else
        WorkerICE->>Agent: Close old agent (if exists)
        WorkerICE->>Agent: Create new agent
        WorkerICE->>WorkerICE: w.agentStartTime = time.Now()
        Agent-->>WorkerICE: Agent enters Connecting
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • pappz
  • lixmal

Poem

🐰 I marked the agent's newborn time,
and told it, "Rest a little, friend of mine."
If it's too fresh and eager to race,
I'll skip the swap and hold its place.
Steady hops keep the loop in line.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly addresses both main changes: fixing an ICE reconnection loop and buffering candidates before agent initialization, accurately reflecting the primary objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed PR description comprehensively covers the bug fix, rationale, related issues, testing, and documentation requirements with all template sections filled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)

184-189: Consider adding a buffer size limit as a defensive measure.

The buffer could theoretically grow unbounded if candidates keep arriving without agent creation (e.g., due to signaling issues). While unlikely in practice, a simple cap would prevent memory accumulation in pathological cases.

♻️ Optional: Add buffer cap
+const maxPendingCandidates = 50
+
 func (w *WorkerICE) OnRemoteCandidate(candidate ice.Candidate, haRoutes route.HAMap) {
 	w.muxAgent.Lock()
 	defer w.muxAgent.Unlock()
 	w.log.Debugf("OnRemoteCandidate from peer %s -> %s", w.config.Key, candidate.String())
 	if w.agent == nil {
+		if len(w.pendingCandidates) >= maxPendingCandidates {
+			w.log.Warnf("pending candidate buffer full, dropping candidate: %s", candidate.Type())
+			return
+		}
 		w.log.Infof("ICE Agent not ready, buffering remote candidate: %s", candidate.Type())
 		w.pendingCandidates = append(w.pendingCandidates, candidate)
 		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 184 - 189, The
pendingCandidates slice can grow unbounded when w.agent == nil; add a fixed cap
(e.g., MAX_PENDING_CANDIDATES constant) and enforce it in the buffering branch
inside the OnNewCandidate / handler where w.pendingCandidates is appended: if
len(w.pendingCandidates) >= MAX_PENDING_CANDIDATES, either drop the oldest
(shift) or drop the new candidate and log a warning via w.log.Infof/warnf,
otherwise append; ensure the constant and behavior are documented in comments
and referenced where pendingCandidates is created/cleared (use symbols
w.pendingCandidates, w.agent, candidate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 184-189: The pendingCandidates slice can grow unbounded when
w.agent == nil; add a fixed cap (e.g., MAX_PENDING_CANDIDATES constant) and
enforce it in the buffering branch inside the OnNewCandidate / handler where
w.pendingCandidates is appended: if len(w.pendingCandidates) >=
MAX_PENDING_CANDIDATES, either drop the oldest (shift) or drop the new candidate
and log a warning via w.log.Infof/warnf, otherwise append; ensure the constant
and behavior are documented in comments and referenced where pendingCandidates
is created/cleared (use symbols w.pendingCandidates, w.agent, candidate).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aabe6227-0bce-4e6d-abac-3b3c6358dde1

📥 Commits

Reviewing files that changed from the base of the PR and between decb5dd and 28e3a06.

📒 Files selected for processing (1)
  • client/internal/peer/worker_ice.go

@MichaelUray MichaelUray force-pushed the fix/ice-guard-loop-and-candidate-buffer branch from 28e3a06 to 64e91ae Compare April 6, 2026 16:16
@pappz pappz self-requested a review April 7, 2026 08:16
MichaelUray pushed a commit to MichaelUray/netbird that referenced this pull request Apr 7, 2026
Addresses coderabbitai review on PR netbirdio#5805: the pendingCandidates slice
could theoretically grow unbounded if remote candidates keep arriving
before the ICE agent is created (e.g. due to signaling issues). Add a
fixed cap of 50 candidates and drop further candidates with a warning
log to protect against memory accumulation in pathological scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 68-69: Buffered remote candidates stored in pendingCandidates
bypass the normal pipeline (candidateViaRoutes and shouldAddExtraCandidate),
causing different behavior depending on arrival timing; fix by applying the same
routing/filtering before buffering and by replaying buffered entries through the
same helper used by the live path: call candidateViaRoutes() and run
shouldAddExtraCandidate() (or the unified helper that encapsulates both) for
each candidate before pushing to pendingCandidates, and when draining
pendingCandidates later invoke the same AddRemoteCandidate() path used for live
arrivals so routed/filtered/synthetic-related-port candidates are handled
identically (update functions: pendingCandidates, candidateViaRoutes,
shouldAddExtraCandidate, AddRemoteCandidate).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 73288d66-2438-465a-b0b8-92d5cf6eed4e

📥 Commits

Reviewing files that changed from the base of the PR and between 28e3a06 and 387a2e3.

📒 Files selected for processing (1)
  • client/internal/peer/worker_ice.go

Comment thread client/internal/peer/worker_ice.go Outdated
@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Apr 7, 2026

@MichaelUray Thank you for the PR.
Regarding the candidate buffering: I think this can be dropped.

The race you're protecting against — OnRemoteCandidate arriving before the agent is created — requires a very specific timing: the remote peer would need to create its agent, start gathering, discover a candidate, and signal it back, all before our handshaker goroutine reads the offer from its channel and calls OnNewOffer. That's microseconds (channel read) vs tens-to-hundreds of milliseconds (STUN/TURN network I/O for candidate gathering), so in practice this almost never happens.

If we ever wanted to truly fix this race, the correct approach would be to remove the go from engine.go:1612:

 // current:                                                                                                                
 go conn.OnRemoteCandidate(candidate, e.routeManager.GetClientRoutes())
  // fix:                                                                                                                    
  conn.OnRemoteCandidate(candidate, e.routeManager.GetClientRoutes())

But of course it is a risky change.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Apr 7, 2026

The session ID mechanism exists for two specific reasons:

  1. Force renegotiation after silent disconnect — when a peer crashes and reconnects with a new session ID, the remote side must tear down its agent and start fresh.
  2. Duplicate offer suppression — same session ID on resend (ICE + relay paths) gets deduplicated.

By skipping all offers with a different session ID while agentConnecting == true, case 1 regresses. If the remote peer
crashed and came back, the local peer ignores its new session ID and keeps trying the current (now dead) ICE attempt until it times out. Recovery gets delayed by the full ICE timeout (30s+).

MichaelUray pushed a commit to MichaelUray/netbird that referenced this pull request Apr 8, 2026
…andidates

Address coderabbitai review on PR netbirdio#5805: previously buffered candidates
bypassed candidateViaRoutes() (route-loop filter) and shouldAddExtraCandidate()
(extra server-reflexive helper), so a candidate arriving before the agent
existed could be added later in a state inconsistent with a candidate
arriving after the agent existed. Worse, the buffered path missed the
extra-srflx-with-related-port candidate that the live path adds.

Refactor OnRemoteCandidate to compute the full list of candidates that
should reach the agent (the routed-out filter and the optional extra
srflx) before deciding whether to buffer or add immediately. The same
filter and extension now apply unconditionally; the only difference
between the two paths is whether we hand the result to the agent right
now or stash it for the next flush.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
client/internal/peer/worker_ice.go (1)

238-247: Minor: Log message may undercount dropped candidates.

When the buffer is full, the return exits the function immediately. If toAdd contains multiple candidates (original + extra srflx), all remaining candidates in the slice are dropped but only one is logged.

This is a minor logging clarity issue and doesn't affect correctness.

💡 Optional improvement for logging accuracy
 	for _, c := range toAdd {
 		if len(w.pendingCandidates) >= maxPendingCandidates {
-			w.log.Warnf("pending candidate buffer full (%d), dropping candidate: %s",
-				maxPendingCandidates, c.Type())
+			w.log.Warnf("pending candidate buffer full (%d), dropping %d candidate(s)",
+				maxPendingCandidates, len(toAdd)-len(w.pendingCandidates)+maxPendingCandidates)
 			return
 		}

Or simply break out of the loop and log once after counting drops.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/internal/peer/worker_ice.go` around lines 238 - 247, The loop over
toAdd currently returns on first overflow, underreporting dropped candidates;
change the logic in the worker (the loop that appends to w.pendingCandidates) to
stop iterating when len(w.pendingCandidates) >= maxPendingCandidates (use break
instead of return), compute how many candidates from toAdd were not buffered
(e.g., dropped := len(toAdd) - i or maintain a dropped counter), and emit a
single w.log.Warnf that reports the total number of dropped candidates along
with maxPendingCandidates and candidate types if desired; ensure you still
append buffered candidates to w.pendingCandidates and retain the info log
w.log.Infof for buffered entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 157-165: The current guard w.agentConnecting unconditionally skips
new offers while an ICE attempt is in progress, which avoids reconnection loops
but can delay recovery when a remote peer crashes and reconnects with a new
session; change this by making the blocking behavior conditional or timeboxed:
add a configurable short timeout or an early-abort check that compares the
incoming remoteOfferAnswer.SessionIDString() to the session involved in the
current ICE attempt and, if they differ, either abort the ongoing attempt
(invoke the existing ICE cancel/cleanup routine) or allow the new offer to
preempt after a configurable threshold (e.g., w.iceConnectingTimeout or a method
like abortConnectingIfNewSession(remoteSessionID)); update logs via
w.log.Infof/w.log.Warnf to record preemption/timeout decisions.
- Around line 180-184: The current call to NewICESessionID() logs an error but
leaves w.sessionID unchanged on failure; update the error handling so you do not
keep a stale ID: only assign w.sessionID = sessionID when err == nil, and on err
either set w.sessionID = "" (or a known-invalid value) or return an error from
the enclosing function; modify the block around NewICESessionID() so it checks
err, clears w.sessionID on failure (or returns the error) and logs the failure
with context before exiting.

---

Nitpick comments:
In `@client/internal/peer/worker_ice.go`:
- Around line 238-247: The loop over toAdd currently returns on first overflow,
underreporting dropped candidates; change the logic in the worker (the loop that
appends to w.pendingCandidates) to stop iterating when len(w.pendingCandidates)
>= maxPendingCandidates (use break instead of return), compute how many
candidates from toAdd were not buffered (e.g., dropped := len(toAdd) - i or
maintain a dropped counter), and emit a single w.log.Warnf that reports the
total number of dropped candidates along with maxPendingCandidates and candidate
types if desired; ensure you still append buffered candidates to
w.pendingCandidates and retain the info log w.log.Infof for buffered entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d2c91f1-409c-45ea-93ba-b1dd7570a51f

📥 Commits

Reviewing files that changed from the base of the PR and between 387a2e3 and c131459.

📒 Files selected for processing (1)
  • client/internal/peer/worker_ice.go

Comment thread client/internal/peer/worker_ice.go Outdated
Comment thread client/internal/peer/worker_ice.go Outdated
@MichaelUray MichaelUray force-pushed the fix/ice-guard-loop-and-candidate-buffer branch 3 times, most recently from e31c5de to 377093f Compare April 13, 2026 04:26
@MichaelUray
Copy link
Copy Markdown
Contributor Author

Thanks @pappz for the thorough review!

I've reworked the PR based on your feedback:

Candidate buffering — removed. You're right that the race is practically impossible given the timing constraints (microsecond channel read vs hundreds of ms STUN/TURN I/O). All three buffering commits have been dropped.

Session-ID skip during agentConnecting — replaced with a time-windowed guard. Instead of unconditionally skipping offers while connecting, the new approach only skips offers with a different session ID during the first 3 seconds after agent creation. This addresses both concerns:

  • Loop prevention: The guard-vs-guard race happens within milliseconds of agent creation (both guards fire on the same ~3s tick). The 3s window covers this with plenty of margin.
  • Crash recovery: After 3s, offers with new session IDs are accepted normally. Since the ICE FailedTimeout is 6s, the worst-case crash-recovery delay is 3s (vs the full 6s+ that the unconditional approach would have caused).

The 3s window aligns with the guard's InitialInterval of 3s, so by the time the first regular guard retry fires, the protection window has just expired.

One factual note: the ICE FailedTimeout in this codebase is 6 seconds (see iceFailedTimeoutDefault in client/internal/peer/ice/agent.go), not 30s+.

Tested on a real Android device (Galaxy S26) with multiple peers in the same LAN (same public IP) — direct P2P connections establish reliably without the reconnection loop.

@MichaelUray MichaelUray force-pushed the fix/ice-guard-loop-and-candidate-buffer branch from 377093f to 7720145 Compare April 13, 2026 05:32
@MichaelUray
Copy link
Copy Markdown
Contributor Author

Addressed the coderabbitai review on the candidate buffer path in commit 7720145 — buffered candidates now flow through the same candidateViaRoutes() and shouldAddExtraCandidate() pipeline as normal candidates, and the session ID error path no longer continues with a stale ID. Ready for another look @pappz.

@MichaelUray
Copy link
Copy Markdown
Contributor Author

@pappz friendly ping — all review feedback has been addressed. Ready for another look whenever you have time. Thanks!

MichaelUray added a commit to MichaelUray/netbird that referenced this pull request Apr 16, 2026
…OnRemoteCandidate synchronously

Per pappz's review on PR netbirdio#5805:

1. Remove the candidate buffer entirely. The race between OnRemoteCandidate
   and OnNewOffer requires the remote peer to complete STUN gathering and
   signal a candidate back in microseconds (channel read time), which is
   far faster than the tens-to-hundreds of milliseconds of network I/O
   needed for candidate discovery. In practice this almost never happens.

2. Remove the 'go' prefix from conn.OnRemoteCandidate() in engine.go so
   candidates are processed synchronously in the signaling receive loop.
   This is the correct fix for the theoretical race: candidates now arrive
   in the same goroutine as offers, eliminating the ordering issue without
   needing a buffer.

The agentLoopProtectionWindow guard (3s) is kept as-is since it correctly
handles both the reconnection loop scenario (same-NAT peers) and crash
recovery (new session ID accepted after the window expires).
@MichaelUray
Copy link
Copy Markdown
Contributor Author

@pappz Following up with detailed test results on both the candidate buffer and the synchronous OnRemoteCandidate change.

Candidate buffer: confirmed unnecessary

I deployed the current PR branch (without buffer, with synchronous OnRemoteCandidate) to a Cudy TR3000 router (OpenWrt, ARM64) and tested against multiple peers:

Peer Version Lost candidates Connection
dk20 (same LAN) dev (this PR) 6 P2P
uray-mic-dh (same LAN) dev (this PR) 13 P2P
uray-mic-d4 (remote) v0.67.1 (stock) 1040 Relayed

The "ICE Agent is not initialized yet" warning still appears (even with synchronous OnRemoteCandidate — see below), but P2P works regardless. The lost candidates are re-signaled on subsequent guard rounds. You were right that the buffer has no practical impact.

Synchronous OnRemoteCandidate: doesn't fix the race, but doesn't hurt

The race condition isn't at the goroutine level — it's at the signaling message level. The remote peer sends host candidates (instant, no STUN needed) before sending its answer. In the receiveMSG loop, CANDIDATE messages arrive before the ANSWER message:

10.598s - Router sends OFFER
10.765s - CANDIDATE message arrives (167ms later) - agent NOT yet created
10.786s - ANSWER message arrives (188ms later) - OnNewOffer creates agent

Host candidates don't need STUN/TURN — they're discovered instantly when the agent is created. So the remote peer signals them before signaling the answer. Making OnRemoteCandidate synchronous doesn't change this ordering since both messages are processed sequentially in the same receiveMSG goroutine.

That said, making it synchronous is still a reasonable cleanup (removes unnecessary goroutine spawn), so I've kept that change.

The 1040 lost candidates from uray-mic-d4

These are from the guard loop on the stock v0.67.1 client — it sends new offers every ~12-30s, each with a new session ID. Each time, the remote peer's host candidates arrive before the answer. The time-windowed guard (3s) on our side prevents the reconnection loop stomping, but the old client keeps re-triggering. The lost candidates accumulate but don't prevent P2P once both sides have the fix.

uray-mic-d4 stays Relayed not because of lost candidates, but because it's behind a different NAT and its guard loop prevents ICE from completing (same root cause this PR fixes).

Current PR state

  • ✅ Time-windowed guard (agentLoopProtectionWindow = 3s) — prevents reconnection loop while allowing crash recovery
  • ✅ Candidate buffer removed
  • ✅ Synchronous OnRemoteCandidate
  • ✅ Session ID error handling improved

Ready for another look whenever you have time.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Apr 20, 2026

@MichaelUray
I checked your PR and I don't see evidence for the root cause of the issue. The offer/answer and candidate ordering is important, and as far as I can tell it's correct. The session ID mechanism was specifically designed to avoid
unnecessary agent recreation. Just because two peers try to connect to each other at the same time, it should not get stuck in a loop.

Could you send a debug bundle from both peers where you highlight the exact sequence that demonstrates the problem you want to fix? Ideally I'd like to see the OnNewOffer lines with session IDs across 3-4 guard cycles on both sides, so we can trace where the session-ID dedup fails.

@MichaelUray
Copy link
Copy Markdown
Contributor Author

Hi @pappz, here is the debug evidence you requested. The analysis below was captured on stock upstream/main at 3098f48b2 with only additive Infof log lines – no behavioral changes – so that the OnNewOffer / closeAgent / send/recv-offer sequence is visible with session IDs on both sides.

Reproduction environment

Two peers, same LAN, same public IP. IPv6 + IPv4 peer-to-peer UDP blocked at netfilter/WFP to force ICE FailedTimeout (relay/signal/management paths untouched). Without that block ICE succeeds fast enough that closeAgent never fires; the loop is the post-failure steady state.

PR #5805 – Session-ID Dedup Failure Evidence

Setup

  • Base code: upstream/main at 3098f48b2 with zero fix-logic, only
    TRACE-PR5805 Infof log lines added in OnNewOffer, closeAgent,
    sendOffer/sendAnswer, and RECV_OFFER/RECV_ANSWER. No behavioral
    changes – the instrumentation is purely additive (see the debug branch).
  • Two peers in the same LAN (same public IP, same /24):
    • dk20 – 192.168.91.120 – linux/amd64
    • w11-test1 – 192.168.91.193 – windows/amd64
  • Both peers have DisableIPv6Discovery: true and – since the kernel/UDPMux
    still binds IPv6 regardless – additional netfilter/WFP rules that drop
    any IPv6 UDP and IPv4 UDP directly between the two peer addresses.
    This forces ICE host-candidate checks to fail so that the FailedTimeout
    path (which rotates sessions via closeAgent) actually fires. Relay
    paths are untouched; traffic to the NetBird management/signal/relay
    servers is unaffected.

The "force ICE failure" step is important: on a happy-path same-LAN pair
(e.g. both peers with IPv6 or with free IPv4 UDP) ICE succeeds fast enough
that closeAgent never runs and the loop never starts. The loop is the
post-failure steady state.

Log fields

+NNNNms   peer        EVENT                  key=val ...
  • +NNNNms – milliseconds since the first captured event (both peers merged
    on a common monotonic axis; dk20 is in UTC, w11-test1 in +02:00, both
    normalised to UTC before merging).
  • SEND_OFFER / RECV_OFFER / RECV_ANSWER – handshaker signaling events.
  • OnNewOffer ENTER – the state snapshot at the beginning of OnNewOffer
    (incoming, local_before, stored_remote, agent_nil, connecting).
  • DECISION= – which branch was taken (create_first_agent / recreate /
    skip_dedup).
  • agent_created – new ICE agent installed, new session stored.
  • closeAgentconnect() goroutine finished (ICE Failed/Disconnected/
    Closed), fresh local session generated, remoteSessionID cleared.

Full chronological trace

Captured on stock upstream/main at 3098f48. The timeline starts at
the first SEND_OFFER; the loop is fully visible within 110 seconds.

+     0ms  dk20       SEND_OFFER local_session=e7737d235f
+    43ms  dk20       RECV_ANSWER remote_session=5a85359990
+    43ms  dk20       OnNewOffer ENTER incoming=5a85359990 local_before=e7737d235f stored_remote= agent_nil=true connecting=false
+    43ms  dk20       DECISION=create_first_agent incoming=5a85359990 local=e7737d235f
+    44ms  dk20       agent_created local=e7737d235f stored_remote=5a85359990 connecting=true

# --- ~12s later: ICE FailedTimeout fires on dk20, closeAgent runs ---
+ 12095ms  dk20       closeAgent old_local=e7737d235f new_local=e21d611f3a cleared_remote=5a85359990 session_changed=false

# dk20 sends offer with its NEW post-closeAgent local session
+ 20554ms  dk20       SEND_OFFER local_session=e21d611f3a
+ 20596ms  dk20       RECV_ANSWER remote_session=5056316402
+ 20596ms  dk20       OnNewOffer ENTER incoming=5056316402 local_before=e21d611f3a stored_remote= agent_nil=true connecting=false
+ 20596ms  dk20       DECISION=create_first_agent incoming=5056316402 local=e21d611f3a
+ 20597ms  dk20       agent_created local=e21d611f3a stored_remote=5056316402 connecting=true

# Both sides' guards fire inside the same agent lifetime – the intra-agent
# dedup path is fine, these offers are ignored:
+ 22967ms  dk20       SEND_OFFER local_session=e21d611f3a
+ 24705ms  dk20       SEND_OFFER local_session=e21d611f3a
+ 28086ms  dk20       SEND_OFFER local_session=e21d611f3a

# Another ~12s, another closeAgent on dk20
+ 32625ms  dk20       closeAgent old_local=e21d611f3a new_local=bd2c5ce78d cleared_remote=5056316402 session_changed=false

+ 34926ms  dk20       SEND_OFFER local_session=bd2c5ce78d
+ 34927ms  dk20       RECV_OFFER remote_session=a8877b548b
+ 34927ms  dk20       OnNewOffer ENTER incoming=a8877b548b local_before=bd2c5ce78d stored_remote= agent_nil=true connecting=false
+ 34927ms  dk20       DECISION=create_first_agent incoming=a8877b548b local=bd2c5ce78d
+ 34927ms  dk20       SEND_ANSWER local_session=bd2c5ce78d
+ 34927ms  dk20       agent_created local=bd2c5ce78d stored_remote=a8877b548b connecting=true

# This is the dedup working as intended – same-agent retransmit:
+ 34968ms  dk20       OnNewOffer ENTER incoming=a8877b548b local_before=bd2c5ce78d stored_remote=a8877b548b agent_nil=false connecting=true
+ 34968ms  dk20       DECISION=skip_dedup reason=session_matches incoming=a8877b548b stored_remote=a8877b548b
+ 34968ms  dk20       RECV_ANSWER remote_session=a8877b548b

+ 45782ms  dk20       SEND_OFFER local_session=bd2c5ce78d
+ 47010ms  dk20       closeAgent old_local=bd2c5ce78d new_local=ca6e8d0d7b cleared_remote=a8877b548b session_changed=false
+ 47388ms  dk20       SEND_OFFER local_session=ca6e8d0d7b
+ 50813ms  dk20       SEND_OFFER local_session=ca6e8d0d7b

# --- w11-test1 side becomes visible (Windows netbird.exe buffers early lines) ---
+ 52914ms  w11-test1  SEND_OFFER local_session=3e3d2cf6c1
+ 52947ms  w11-test1  RECV_ANSWER remote_session=ca6e8d0d7b
+ 52947ms  w11-test1  OnNewOffer ENTER incoming=ca6e8d0d7b local_before=3e3d2cf6c1 stored_remote= agent_nil=true connecting=false
+ 52947ms  w11-test1  DECISION=create_first_agent incoming=ca6e8d0d7b local=3e3d2cf6c1
+ 52955ms  w11-test1  agent_created local=3e3d2cf6c1 stored_remote=ca6e8d0d7b connecting=true

+ 53714ms  dk20       RECV_OFFER remote_session=3e3d2cf6c1
+ 53714ms  dk20       OnNewOffer ENTER incoming=3e3d2cf6c1 local_before=ca6e8d0d7b stored_remote= agent_nil=true connecting=false
+ 53714ms  dk20       DECISION=create_first_agent incoming=3e3d2cf6c1 local=ca6e8d0d7b
+ 53714ms  dk20       SEND_ANSWER local_session=ca6e8d0d7b
+ 53714ms  dk20       agent_created local=ca6e8d0d7b stored_remote=3e3d2cf6c1 connecting=true

# Both peers now have lockstep ~12s lifetimes, rotating in sync:
+ 64998ms  w11-test1  closeAgent old_local=3e3d2cf6c1 new_local=22a0af9a10 cleared_remote=ca6e8d0d7b session_changed=false
+ 65798ms  dk20       closeAgent old_local=ca6e8d0d7b new_local=f5b6935722 cleared_remote=3e3d2cf6c1 session_changed=false

+ 68923ms  w11-test1  RECV_OFFER remote_session=f5b6935722
+ 68923ms  w11-test1  OnNewOffer ENTER incoming=f5b6935722 local_before=22a0af9a10 stored_remote= agent_nil=true connecting=false
+ 68923ms  w11-test1  DECISION=create_first_agent incoming=f5b6935722 local=22a0af9a10
+ 68932ms  w11-test1  SEND_ANSWER local_session=22a0af9a10
+ 68932ms  w11-test1  agent_created local=22a0af9a10 stored_remote=f5b6935722 connecting=true

+ 69689ms  dk20       SEND_OFFER local_session=f5b6935722
+ 69732ms  dk20       RECV_ANSWER remote_session=22a0af9a10
+ 69732ms  dk20       OnNewOffer ENTER incoming=22a0af9a10 local_before=f5b6935722 stored_remote= agent_nil=true connecting=false
+ 69732ms  dk20       DECISION=create_first_agent incoming=22a0af9a10 local=f5b6935722
+ 69733ms  dk20       agent_created local=f5b6935722 stored_remote=22a0af9a10 connecting=true

+ 81013ms  w11-test1  closeAgent old_local=22a0af9a10 new_local=838fdcd663 cleared_remote=f5b6935722 session_changed=false
+ 81776ms  dk20       closeAgent old_local=f5b6935722 new_local=c5df0189a9 cleared_remote=22a0af9a10 session_changed=false

+ 96180ms  w11-test1  RECV_OFFER remote_session=c5df0189a9
+ 96180ms  w11-test1  OnNewOffer ENTER incoming=c5df0189a9 local_before=838fdcd663 stored_remote= agent_nil=true connecting=false
+ 96180ms  w11-test1  DECISION=create_first_agent incoming=c5df0189a9 local=838fdcd663
+ 96189ms  w11-test1  SEND_ANSWER local_session=838fdcd663
+ 96189ms  w11-test1  agent_created local=838fdcd663 stored_remote=c5df0189a9 connecting=true

+ 96947ms  dk20       SEND_OFFER local_session=c5df0189a9
+ 96988ms  dk20       RECV_ANSWER remote_session=838fdcd663
+ 96988ms  dk20       OnNewOffer ENTER incoming=838fdcd663 local_before=c5df0189a9 stored_remote= agent_nil=true connecting=false
+ 96988ms  dk20       DECISION=create_first_agent incoming=838fdcd663 local=c5df0189a9
+ 96989ms  dk20       agent_created local=c5df0189a9 stored_remote=838fdcd663 connecting=true

+108266ms  w11-test1  closeAgent old_local=838fdcd663 new_local=2b6ed58608 cleared_remote=c5df0189a9 session_changed=false
+109038ms  dk20       closeAgent old_local=c5df0189a9 new_local=452cfe710b cleared_remote=838fdcd663 session_changed=false

Session-ID timeline (7 cycles in 109s)

Cycle dk20 local session w11-test1 local session
0 e7737d235f 5a85359990
1 e21d611f3a 5056316402
2 bd2c5ce78d a8877b548b
3 ca6e8d0d7b 3e3d2cf6c1
4 f5b6935722 22a0af9a10
5 c5df0189a9 838fdcd663
6 452cfe710b 2b6ed58608

Every ~12s (= iceFailedTimeoutDefault) both peers rotate their session
IDs, and the next offer/answer the other side receives carries that rotated
session.

Where session-ID dedup fails

Every OnNewOffer ENTER log in the steady-state loop shows

stored_remote=  agent_nil=true  connecting=false

i.e. the dedup check

if w.remoteSessionID == *remoteOfferAnswer.SessionID {
    // skip
}

can never match: w.remoteSessionID has been reset to the empty string by
the previous closeAgent (worker_ice.go:338), while the incoming
SessionID is the remote's freshly rotated value. Dedup is not failing
because of a bug in its comparison – it is simply never reachable, because
both sides have closeAgent-cleared remoteSessionID and a rotated
local session at the exact moment the next offer arrives.

That is the core of the loop:

  1. closeAgent on A – remoteSessionID="", new sessionID.
  2. closeAgent on B – remoteSessionID="", new sessionID.
  3. A's next SEND_OFFER uses the new local session; B receives it, sees
    empty stored-remote, has agent_nil=true, so it goes to
    create_first_agent and stores the new remote.
  4. In parallel, B's SEND_OFFER does the same thing to A.
  5. Both agents start the 12s ICE attempt, both fail, both rotate, back to
    step 1.

Because the rotation on both sides is triggered by the same global signal
(the ICE FailedTimeout of the previous round, which fires within a few
hundred milliseconds of each other), the cycles stay phase-locked for as
long as ICE keeps failing. In the captured run the two closeAgent events
are 763 ms, 800 ms, 772 ms, and 772 ms apart across four cycles.

Does the session-ID mechanism fulfil its two documented purposes?

Yes, both still work inside a single agent lifetime. The merged trace at
+34927ms/+34968ms shows the "duplicate offer on ICE+relay paths"
dedup firing correctly (DECISION=skip_dedup reason=session_matches). And
the "crash-recovery after a silent remote restart" path is preserved – a
remote with a new session ID still tears down the old agent. The loop is
specifically the simultaneous case where both sides rotate inside the
same ICE failed-timeout window.

Current PR #5805 behaviour

With agentLoopProtectionWindow applied, the first rotation on side A
will still happen (A cannot see B's session rotation in advance), but
within the protection window A refuses to tear its freshly-rebuilt agent
down a second time. That one-sided stability is enough to let the guard
settle, because the other side then does observe a stable remote
session on its next offer and its dedup path kicks in normally. After the
protection window expires, the "new session ID replaces current agent"
semantics are restored, so crash-recovery is delayed by at most
agentLoopProtectionWindow (currently 3 s), not by the ICE FailedTimeout.

Files in this bundle

  • dk20.log – raw TRACE-PR5805 lines from dk20, UTC.
  • w11-test1.log – raw TRACE-PR5805 lines from w11-test1, +02:00.
  • merged.log – both peers merged on a common UTC axis, offsets in ms
    from the first event.
  • analysis.md – this document.

The instrumentation patch (against upstream/main at 3098f48) and the
reproduction scripts are on branch debug/ice-session-trace-for-pr5805.

@pappz
Copy link
Copy Markdown
Collaborator

pappz commented Apr 24, 2026

Thank you for your analysis. I tried to reproduce the issue to confirm the original report. In a lab environment, I introduced 1, 1.5, 3, etc. second delays in the connection. The P2P connection always succeeded and never got stuck in a loop. Could you please provide instructions on how I can repeat your tests?

MichaelUray and others added 3 commits April 25, 2026 09:43
…public IP

When two peers are behind the same NAT (same public IP), ICE P2P
connections could never be established because of a reconnection loop:

1. Guard on peer A sees no P2P connection and sends a new offer
2. Peer B receives the offer with a new session ID
3. Peer B cancels its active ICE agent and creates a new one
4. Before the new ICE agent can complete connectivity checks,
   peer B's guard sends a new offer to peer A
5. Peer A cancels its ICE agent → cycle repeats indefinitely

The fix adds a short protection window (3 seconds) after agent creation
during which new offers with different session IDs are skipped. This is
long enough to prevent the guard-vs-guard race (guards fire every 3s)
but short enough to not delay crash recovery — the ICE FailedTimeout
is only 6 seconds, so at worst recovery is delayed by 3s.

This replaces the earlier unconditional skip approach based on
reviewer feedback (pappz): unconditional blocking during agentConnecting
would delay crash-recovery by the full ICE timeout (6s+). The
time-windowed approach preserves the session-ID mechanism for crash
detection while eliminating the reconnection loop.

Candidate buffering (previously in this PR) has been removed per
reviewer feedback — the race condition it protected against is
practically impossible given the timing constraints.

Fixes: netbirdio#3669
Related: netbirdio#2703
Related: netbirdio#3339
…ssion ID handling

- Add pendingCandidates buffer for remote candidates arriving before ICE
  agent initialization (bounded to 50 entries)
- Route candidateViaRoutes check before buffering so routed candidates
  are filtered identically regardless of arrival timing
- Extract addRemoteCandidate helper used by both live and buffered paths
  ensuring shouldAddExtraCandidate logic runs for all candidates
- Clear stale session ID (set to "") when NewICESessionID() fails in
  handleExistingAgent and closeAgent instead of keeping previous value
- Clear pendingCandidates on agent close to prevent stale data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…OnRemoteCandidate synchronously

Per pappz's review on PR netbirdio#5805:

1. Remove the candidate buffer entirely. The race between OnRemoteCandidate
   and OnNewOffer requires the remote peer to complete STUN gathering and
   signal a candidate back in microseconds (channel read time), which is
   far faster than the tens-to-hundreds of milliseconds of network I/O
   needed for candidate discovery. In practice this almost never happens.

2. Remove the 'go' prefix from conn.OnRemoteCandidate() in engine.go so
   candidates are processed synchronously in the signaling receive loop.
   This is the correct fix for the theoretical race: candidates now arrive
   in the same goroutine as offers, eliminating the ordering issue without
   needing a buffer.

The agentLoopProtectionWindow guard (3s) is kept as-is since it correctly
handles both the reconnection loop scenario (same-NAT peers) and crash
recovery (new session ID accepted after the window expires).
@MichaelUray MichaelUray force-pushed the fix/ice-guard-loop-and-candidate-buffer branch from d4d191b to 90b59d9 Compare April 25, 2026 15:43
@sonarqubecloud
Copy link
Copy Markdown

@MichaelUray
Copy link
Copy Markdown
Contributor Author

@pappz Thank you for the thorough review and the back-and-forth on this PR. Your feedback shaped the implementation considerably — dropping the candidate buffer, replacing the unconditional session-ID skip with a time-windowed guard, and pushing for proper repro evidence all made the change much more defensible than what I originally opened.

After today's comprehensive re-testing on real hardware, I'm closing this PR. Here's the analysis.

Test setup (2026-04-25)

Six Cudy TR3000 routers (OpenWrt 25.12.2-FWR, aarch64, kmod-wireguard / userspace WG) joined into one NetBird group with an all-to-all policy. UDP between same-NAT peer LAN IPs was blocked at nft to force the failure mode. DisableIPv6Discovery=true. Each round ran 90 s with --log-level debug.

Peer Site / WAN
5723C, 572A2, 57262 Graz home (same public NAT, 91.0/24 LAN)
5731A Graz home, LTE/CGNAT WAN
D95820, 572E2 Remote site (same public NAT, 0.0/24 LAN)

Three rounds, identical setup, only the binary swapped.

Results

Round Binary OnNewOffer (range across 6 routers) ICE Failed ICE Closed recreate-agent skipping new offer (#5805 marker)
1 vanilla 0.67.4 21 – 27 23 – 27 23 – 27 21 – 27 0
2 this PR (pr5805-rebased on main, sha 90b59d944) 15 – 19 18 – 20 18 – 20 15 – 19 0
3 upstream main d6f08e484 (#5820 merged, without #5805) 15 – 21 19 – 20 19 – 20 15 – 21 0

ICE Connected was 0 in all rounds (UDP fully blocked, all paths relayed — that's by design; the test isolates the reconnect-loop dynamics, not the relay/p2p selection).

What this means

  1. Vanilla 0.67.4 reproduces the loop on every router, including the remote site — so the bug isn't environment-specific to my home setup.
  2. PR [client] Supress ICE signaling #5820 alone (your "Suppress ICE signaling" change merged on Apr 21) already cuts the churn by ~25 % and eliminates the lockstep character. This matches your hypothesis that the original report was driven by signaling-level amplification rather than the OnNewOffer-tear-down path.
  3. This PR on top of main shows no measurable additional reduction. Round 2 vs Round 3 are statistically indistinguishable with this sample.
  4. The agentLoopProtectionWindow (3 s) never fires in practice — the observed cycle period is consistently > 3 s, so the skipping new offer log line stays at 0 even with the fix in place. The defensive logic exists but doesn't engage in the scenarios I can reproduce.

Closing rationale

Given that:

  • the visible behavior (lockstep loop, session-ID rotation in stock 0.67.4) is already addressed by [client] Supress ICE signaling #5820,
  • the additional guard in this PR adds machinery without an observable effect in the test scenarios I have,
  • and you raised reasonable concerns about the session-ID mechanism semantics that I haven't fully resolved,

it doesn't make sense to keep this open as a candidate for merge. If the original lockstep pattern resurfaces on a future build, the right starting point will be a fresh investigation against then-current main, not a rebase of this branch.

For the lab repro you asked about

The reason artificial latency injection didn't reproduce the loop in your lab is that the trigger isn't latency — it's complete UDP loss between the two peer LAN endpoints (so all ICE candidates fail FailedTimeout) combined with both peers being on the same public NAT (so neither side ever falls back into a "different routes" branch that would break symmetry). Reproduction recipe for posterity:

# On both peers in the same LAN, behind the same NAT:
nft add table inet ice_test
nft 'add chain inet ice_test input { type filter hook input priority -100; }'
nft 'add chain inet ice_test output { type filter hook output priority -100; }'
nft "add rule inet ice_test input  ip saddr <other_peer_lan_ip> meta l4proto udp drop"
nft "add rule inet ice_test output ip daddr <other_peer_lan_ip> meta l4proto udp drop"

# Disable IPv6 discovery so it doesn't paper over the v4 failure:
sed -i 's|"DisableIPv6Discovery": false|"DisableIPv6Discovery": true|' /etc/netbird/config.json

On stock 0.67.4 you'll see lockstep OnNewOfferICE ConnectionState FailedClosedrecreate ICE agent cycles every 12-22 s, with session IDs rotating. On current main (with #5820) the cycle still happens but no longer in lockstep and at ~75 % of the rate.

Thanks again for sticking with this — both for the technical pushback that made me prove the claims properly, and for getting #5820 across the line.

— Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ICE Agent is not initialized yet on android app Unable to connect to netbird peer on differnt LAN with the same public IP

3 participants